Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
在本文中,我们研究了现实世界图像脱毛的问题,并考虑了改善深度图像脱布模型的性能的两个关键因素,即培训数据综合和网络体系结构设计。经过现有合成数据集训练的脱毛模型在由于域移位引起的真实模糊图像上的表现较差。为了减少合成和真实域之间的域间隙,我们提出了一种新颖的现实模糊合成管道来模拟摄像机成像过程。由于我们提出的合成方法,可以使现有的Deblurring模型更强大,以处理现实世界的模糊。此外,我们开发了一个有效的脱蓝色模型,该模型同时捕获特征域中的非本地依赖性和局部上下文。具体而言,我们将多路径变压器模块介绍给UNET架构,以进行丰富的多尺度功能学习。在三个现实世界数据集上进行的全面实验表明,所提出的Deblurring模型的性能优于最新方法。
translated by 谷歌翻译
3D感知的生成模型已经证明了它们的出色性能,从而从单眼2D图像集合中生成3D神经辐射场(NERF),甚至对于拓扑视为对象类别。但是,这些方法仍然缺乏分别控制生成的辐射场中对象的形状和外观的能力。在本文中,我们提出了一个生成模型,用于合成具有分离形状和外观变化的拓扑变体对象的辐射场。我们的方法生成可变形的辐射字段,该字段构建了对象的密度字段之间的密度对应关系,并在共享模板字段中编码它们的外观。我们的分解是以无监督的方式实现的,而没有向先前的3D感知gan培训引入额外的标签。我们还开发了一种有效的图像反转方案,用于在真实的单眼图像中重建对象的辐射场并操纵其形状和外观。实验表明,我们的方法可以从非结构化的单眼图像中成功学习生成模型,并很好地解散具有较大拓扑方差的物体(例如椅子)的形状和外观。经过合成数据训练的模型可以忠实地在给定的单个图像中重建真实对象,并获得高质量的纹理和形状编辑结果。
translated by 谷歌翻译
随着移动摄影技术的迅速发展,主要的手机制造商正在争先恐后地提高设备的拍摄能力和软件的照片美化算法。但是,智能设备和算法的改进不能取代人类的主观摄影技术。在本文中,我们提出了图像的美学语言指导(ALG)。我们根据指导规则是基于摄影模板还是指导图像,将ALG分为ALG-T和ALG-I。无论是ALG-T还是ALG-I,我们都会从三个颜色,照明和图像组成的属性中指导摄影。输入图像和摄影模板或指导图像之间的三个属性的差异用自然语言描述,即美学自然语言指导(ALG)。另外,由于景观图像和肖像图像之间的照明和组成差异,我们将输入图像分为景观图像和肖像图像。 ALG-T和ALG-I分别针对两种类型的输入图像(景观图像和肖像图像)进行美学指导。
translated by 谷歌翻译
近年来,图像生成在提高图像质量方面取得了长足的进步,从而产生了高保真性。另外,最近还有一些建筑设计,它使甘恩能够毫不客气地学习不同层中表示的语义属性。但是,对于与人类美学更一致的面部图像仍然缺乏研究。基于Eigengan [He等,ICCV 2021],我们将增强学习的技术构建到Eigengan的发电机中。该代理商试图弄清楚如何将生成的人脸的语义属性更改为更可取的面部。为此,我们训练了一种可以进行面部美容预测的美学评分模型。我们还可以利用此评分模型来分析面部属性和美学得分之间的相关性。从经验上讲,使用增强学习的现成技术无法正常工作。因此,相反,我们提出了一种新的变体,该变体纳入了近年来在强化学习社区中出现的成分。与原始生成的图像相比,调整后的图像显示了有关各种属性的明确区别。实验结果使用思维镜,显示了所提出的方法的有效性。更改的面部图像通常更具吸引力,并有明显改善的美学水平。
translated by 谷歌翻译
随着社交软件和多媒体技术的持续发展,图像已成为传播信息和社交的重要载体。如何全面评估图像已成为最近研究的重点。传统的图像美学评估方法通常采用单个数值总体评估评分,该评估具有一定的主观性,无法再满足更高的美学要求。在本文中,我们构建了一个称为Aesthetic混合数据集的新图像属性数据集,该数据集具有属性(AMD-A)和设计融合的外部属性功能。此外,我们还提出了一种有效的方法,用于在混合多属性数据集上进行图像美学属性评估,并通过使用ExtisticNet-B0作为骨干网络来构建多任务网络体系结构。我们的模型可以实现美学分类,整体评分和属性评分。在每个子网络中,我们通过ECA通道注意模块改进特征提取。至于最终的整体评分,我们采用了教师学习网络的想法,并使用分类子网络来指导美学的整体细粒回归。实验结果,使用思维螺旋式的结果表明,我们提出的方法可以有效地改善美学整体和属性评估的性能。
translated by 谷歌翻译
这项研究使用Tiktok(n = 8,173)来研究最近黑人生活问题运动中抗议范式的短形式视频平台。采用计算机介导的视觉分析,计算机视觉,以确定多媒体内容中的四个视觉抗议(RIOT,COMPRANTATION,COMPROTATION,COMPAINCALE和DEBATE)的存在。描述性统计和t检验的结果表明,在Tiktok上很少发现三个合法化框架 - 暴动,对抗和奇观 - 而辩论框架(赋予边缘化社区)的辩论框架占据了公共领域的主导。但是,尽管三个合法化框架获得了较低的社交媒体可见性,但按照观点,喜欢,分享,追随者和持续时间衡量,但合法化的要素,例如辩论框架,少数群体身份和非正式来源,通常不受Tiktok受众的青睐。 。这项研究得出的结论是,尽管简短的视频平台可能会挑战内容创作者方面的抗议范式,但社交媒体可见性衡量的受众偏爱仍可能与抗议范式相关。
translated by 谷歌翻译
最近的作品表明,经过非结构化单图像收集训练的3D感知gan可以生成新颖实例的多视图像。实现此目的的关键基础是3D辐射场发电机和卷渲染过程。但是,由于神经量渲染的高计算成本,现有方法无法生成高分辨率图像(例如,最高256x256),或者依靠2D CNN来进行图像空间上采样,从而危害了不同视图的3D一致性。本文提出了一种新颖的3D感知gan,可以产生高分辨率图像(最高1024x1024),同时保持严格的3D一致性,如音量渲染。我们的动机是直接在3D空间中实现超分辨率,以保持3D一致性。我们通过在最近的生成辐射歧管(GRAM)方法中定义的一组2D辐射歧管上应用2D卷积,避免了原本高昂的计算成本,并应用专门的损失函数以高分辨率进行有效的GAN训练。 FFHQ和AFHQV2数据集的实验表明,我们的方法可以产生高质量的3D一致性结果,从而大大胜过现有方法。
translated by 谷歌翻译